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Abstract 

Latent feature models are widely used to decompose data into a small number of components. 
Bayesian nonparametric variants of these models, which use the Indian buffet process (IBP) as 
a prior over latent features, allow the number of features to be determined from the data. We 
present a generalization of the IBP, the distance dependent Indian bujfet process (dd-IBP), for 
modeling non-exchangcablc data. It relies on distances defined between data points, biasing 
nearby data to share more features. The choice of distance measure allows for many kinds of 
dependencies, including temporal and spatial. Further, the original IBP is a special case of the 
dd-IBP. In this paper, we develop the dd-IBP and theoretically characterize its fcatme-sharing 
properties. We derive a Markov chain Monte Carlo sampler for a linear Gaussian model with a 
dd-IBP prior and study its performance on several non-exchangeable data sets. 

KEYWORDS: Bayesian nonparametrics, dimensionality reduction, matrix factorization 
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1 Introduction 



Many natural phenomena decompose into latent features. For example, visual scenes can be de- 
composed into objects; genetic regulatory networks can be decomposed into transcription factors; 
music can be decomposed into spectral components. In these examples, multiple latent features 
can be simultaneously active, and each can influence the observed data. Dimensionality reduction 
methods, such as principal component analysis, factor analysis, and probabilistic matrix factor- 
ization, provide a statistical approach to inferring latent features These methods characterize 
a small set of dimensions, or features, and model each data point as a weighted combination of 
these features. Dimensionality reduction can improve predictions and identify hidden structures in 
observed data. 

Dimensionality reduction methods typically require that the number of latent features (i.e., the 
number of dimensions) be fixed in advance. Researchers have recently proposed a more flexible 
approach based on Bayesian nonparametric models, where the number of features is inferred from 
the data through a posterior distribution. These models are usually based on the Indian buffet 
process (IBP; |16|ll7j). a prior over binary matrices with a finite number of rows (corresponding to 
data points) and an infinite number of columns (corresponding to latent features). Using the IBP 
as a building block, Bayesian nonparametric latent feature models have been applied to several 
statistical problems (e.g., pJl [201 EZl ES] ) . Since the number of features is effectively unbounded, 
these models are sometimes known as "infinite" latent feature models. 

The IBP assumes that data are exchangeable: permuting the order of rows leaves the probability 
of a feature matrix unchanged. This assumption may be appropriate for some data sets, but for 
many others we expect dependencies between data points and, consequently, between their latent 
representations. As examples, the latent features describing human motion are autocorrelated 
over time; the latent features describing environmental risk factors are autocorrelated over space. 
In this paper, we present a generalization of the IBP — the distance dependent IBP (dd-IBP) — 
that addresses this limitation. The dd-IBP allows infinite latent feature models to capture non- 
exchangeable structure. 

The problem of adapting nonparametric models to non-exchangeable data has been studied exten- 
sively in the mixture-modeling literature. In particular, variants of the Dirichlet process mixture 
model allow dependencies between data points (e.g., [Hi [151 [71 [121 [2U [1] ) . These dependencies may 
be spatial, temporal or more generally covariate-dependent; the effect of such dependencies is to 
induce sharing of features between nearby data points. 

Among these methods is the distance dependent Chinese restaurant process (dd-CRP; [6]). The 
dd-CRP is a non-exchangeable generalization of the Chinese restaurant process (CRP), the prior 
over partitions of data that emerges in Bayesian nonparametric mixture modeling [51 IT^ 1^ . The 
dd-CRP models non-exchangeability by using using distances between data points — nearby data 
points (e.g., in time or space) are more likely to be assigned to the same mixture component. The 
dd-IBP extends these ideas to infinite latent feature models, where distances between data points 
infiuence feature-sharing, and nearby data points are more likely to share latent features. 



We review the IBP in Section |2.1| and develop the dd-IBP in Section 2.2 Like the dd-CRP 



the dd-IBP lacks marginal invariance, which means that removing one observation changes the 
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distribution over the other observations. We discuss this property further in Section 2^ Although 
many Bayesian nonparametric models have this property, we view it as a particular modeling choice 
that may be appropriate for some problems but not for others. 



Several other infinite latent feature models have been developed to capture dependencies between 
data in different ways, for example using phylogenetic trees [21] or latent Gaussian processes [29] . 
Of particular relevance to this work is the model of Zhou et al. [31], which uses a hierarchical 
beta process to couple data. These and other related models are discussed further in Section [3j In 
Section [4j we characterize the feature-sharing properties of the dd-IBP and compare it to those of 
the model proposed by Zhou et al. [3T]. We find that the different models capture qualitatively 
distinct dependency structures. 

Exact posterior inference in the dd-IBP is intractable. We present an approximate inference algo- 
rithm based on Markov chain Monte Carlo (MCMC; [27]) in Section [5| and we apply this algorithm 
in Section [6| to infer the latent features in a linear- Gaussian model. The experimental results pre- 
sented in Section [7| suggest that the dd-IBP is an effective tool for modeling latent structure in 
data with dependencies between observations. 

2 The distance dependent Indian buffet process 

We first review the definition of the IBP and its role in defining infinite latent feature models. We 
then introduce the dd-IBP. 

2.1 The Indian buffet process 

The IBP is a prior over binary matrices Z with an infinite number of columns \W\ 117] . In the 
Indian buffet metaphor, rows of Z correspond to customers and columns correspond to dishes. In 
data analysis, the customers represent data points and the dishes represent features. Let Zik denote 
the entry of Z at row i and column k. Whether customer i has decided to sample dish k (that is, 
whether Zj/; = 1) corresponds to whether data point i possesses feature k. 

The IBP is defined as a sequential process. The first customer enters the restaurant and samples 
the first Al ~ Poisson(a) number of dishes, where the hyperparameter a is a scalar. In the binary 
matrix, this corresponds to the first row being a contiguous block of ones, whose length is the 
number of dishes sampled (Ai), followed by an infinite block of zeros. 

Subsequent customers i = 2, . . . , N enter, sampling each previously sampled dish according to its 
popularity, 

p{zik = l\zi,(^i_i)) = ruik/i, (1) 

where rriik = Y2j<i ^jk is the number of customers that sampled dish k prior to customer i. (We 
emphasize that Eq. [Tjapplies only to dishes k that were previously sampled, i.e., for which mj^ > 0.) 
Then, each customer samples Aj ~ Poisson(a/i) new dishes. Again these are represented as a 
contiguous block of ones in the columns beyond the last dish sampled by a previous customer. 
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Figure 1: Schematic of the IBP. An example of a latent feature matrix (Z) generated by the 
IBP. Rows correspond to customers (data points) and columns correspond to dishes (features). 
Gray shading indicates that a feature is active for a given data point. The last row illustrates 
the assignment process for a new customer; the counts for each feature (rrij^) are shown inside the 
circles for previously sampled features. 

Though described sequentially, Griffiths and Ghahramani [16] showed that the resulting rows of 
the binary matrix are exchangeable (up to a permutation of the columns). This means that the 
order of the customers does not affect the probability of the resulting binary matrix. This is seen 
in the Beta-Bernoulli perspective, which we review in Section [3} In the next section, we develop a 
generalization of the IBP that relaxes this assumption. 

2.2 The distance dependent Indian buffet process 

Like the IBP, the dd-IBP is a distribution over binary latent feature matrices with a finite number 
of rows and an infinite number of columns. Each pair of customers has an associated distance, e.g., 
distance in time or space, or based on a covariate. Two customers that are close together in this 
distance will be more likely to share the same dishes (that is, features) than two customers that 
are far apart. 

The dd-IBP can be understood in terms of the following sequential construction. First, each 
customer selects a Poisson-distributed number of dishes (feature columns). The dishes selected by 
a customer in this phase of the construction are said to be "owned" by this customer. A dish is 
either unowned, or is owned by exactly one customer. This step is akin to the selection of new 
dishes in the IBP. 

Then, for each owned dish, customers connect to one another. The probability that one customer 
connects to another decreases in the distance between them. Note that customers do not sample 
each dish, as in the IBP, but rather connect to other customers. 
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Finally, dish inheritance is computed: A customer inherits a dish if its owner (from the first step) 
is reachable in the connectivity graph for that dish. This inheritance is computed deterministically 
from the connections generated in the previous stepj^ The dishes that each customer samples are 
those that he inherits or owns. Thus, similarity of sampled dishes between nearby customers is 
induced via distance-dependent connection probabilities. 

We now more formally describe the probabilistic generative process of the binary matrix Z. First, 
we introduce some notation and terminology. 

• Dishes (columns of Z) are identified with the natural numbers N = {1,2,...}. The set of 
dishes owned by customer i is /Cj C N. The cardinality of this set is Aj = |/Ci|. These sets are 
disjoint, so K-i D ICj = 9 for i j. The total number of owned dishes is K = YliLi^i- The 
set of dishes owned by customers excluding i is /C_j = Uj^j/Cj. 

• Each dish is associated with a set of customer-to-customer assignments, specified by the NxK 
connectivity matrix C, where Cik = j indicates that customer i connects to customer j for 
dish k. Given C, the customers form a set of (possibly cyclic) directed graphs, one for each 
dish. The ownership vector is c* , where G { 1 , . . . , N} indicates the customer who owns 
dish k, so = i <;=^ A; G /Cj. 

• The N X N distance matrix between customers is D, where the distance between customers 
i and j is dij. A customer's self-distance is 0: da = 0. We call the distance matrix sequential 
when dij = oo for j > i. In this special case, customers can only connect to previous 
customers. 

• The decay function / : M i— [0, 1] maps distance to a quantity, which we call proximity, that 
controls the probabilities of customer links. We require that /(O) = 1 and /(oo) = 0. We 
obtain the normalized proximity matrix A by applying the decay function to each customer 
pair and normalizing by customer. That is, Uij = f {dij) /hi, where hi = '}2!j=i fi^ij)- 

Using this notation, we generate the feature indicator matrix Z as follows: 

1. Assign dish ownership. For each customer i, allocate \i ~ Poisson(a//ii) unowned dishes 
to the customer's set of owned dishes, /Cj. For each k G /Cj, set the ownership c\ = i. 

2. Assign customer connections. For each customer i and dish k G {!,..., i^}, draw a 
customer assignment according to P{cik = j|D, /) = ajj, j = 1, . . . , A^. Note that customers 
can connect to themselves. In this case, they do not inherit a dish unless they own it (see the 
next step). 

3. Compute dish inheritance. We say that customer j inherits dish k if there exists a path 
along the directed graph for dish k from customer j to the dish's owner c\ (i.e., if is 
reachable from j), where the directed graph is defined by column A; of C. The owner of a dish 

^If one insists upon a complete gastronomical metaphor, customer connectivity can be thought of as "I'll have 
what he's having." 
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Figure 2: Schematic of the dd-IBP. An example of a latent feature matrix generated by the 
dd-IBP. Rows correspond to customers (data points) and columns correspond to dishes (features). 
Customers connect to each other, as indicated by arrows. Customers inherit a dish if the owner of 
that dish (c|., indicated by stars) is reachable by a sequence of connections. Gray shading indicates 
that a feature is active for a given data point. 

automatically inherits itj^ We encode reachability with £. If customer j is reachable from 
customer i for dish k then Cijk = 1. Otherwise Cijk = 0. 

4. Compute the feature indicator matrix. For each customer i and dish k we set Zik = 1 

if i inherits k, otherwise Zik = 0. 

An example of customer assignments sampled from the dd-IBP is shown in Figure [2| In this 
example, customer 1 owns dish 1; customers 2-4 all reach customer 1, either directly or through a 
chain, and thereby inherit the dish (indicated by gray shading). Consequently, feature 1 is active 
for customers 1-4. Dish 2 is owned by customer 2; only customer 1 reaches customer 2, and hence 
feature 2 is active for customers 1 and 2. Dish 3 is owned by customer 2, but no other customers 
reach customer 2, and hence feature 3 is active only for that customer. 

The generative process of the dd-IBP defines the following joint distribution of the ownership vector 
and connectivity matrix, 

P(C, c* I D, a, f) = P{c* I a)P(C | c*, D, /). (2) 

Consider the first term. Recall that the set of dishes each customer owns /Cj and the total number 
of owned dishes K are both functions of the ownership vector c*. Thus, the probability of the 
ownership vector is 

N 

P{c*\a) = '[[P{Xi\a), (3) 

1=1 

where c* is a deterministic function of Ai, . . . , Atv- 

^Although customer i can link to other customers for dish k even if fc G Id, these connections are ignored in 
determining dish inheritance when k £ Id. 
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Figure 3: Decay functions. Each panel presents a different latent feature matrix, sampled from 
the dd-IBP with sequential distances. Decay functions are shown in the insets. 

Consider the second term. The conditional distribution of the connectivity matrix C depends 
on the total number of owned dishes and the normalized proximity matrix A (derived from the 
distances and decay function), 

TV K 

p(cic*,D,/)=nn"-.- 

i=l k=l 

The dependence on c* comes from K, which is determined by c*. 

Random feature models (and the traditional IBP) operate with a random binary matrix Z. In the 
dd-IBP, Z is a (deterministic) many-to-one function of the random variables C and c*, which we 
denote by (p. We compute the probability of a binary matrix by marginalizing out the appropriate 
configurations of these variables 

P(Z|D,a,/)= P{c*,C\B,a,f). (5) 

(c*,C):(/){c*,C)=Z 

The dd-IBP reduces to the standard IBP in the special case when f{d) = 1 for all (i < oo and the 
distance matrix is sequential. (Recall: D is sequential if dij = oo for j > i.) To see this, consider 
the probability that the fcth dish is sampled by the ith customer (that is, Zik = 1). This probability 
is the proportion of previous customers that already reach because the probabiity of connecting 
to each customer is proportional to one. This probability is rriki/i, which is the same as in the 
IBP. This is akin to the relationship between the dd-CRP and the traditional CRP under the same 
condition. 

Many different decay functions are possible within this framework. Figure [3] shows samples of 
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Z using four decay functions and a sequential distance defined by absolute temporal distance 
{dij = i — j for i > j and dij = oo for j > i). 

• The constant, f{d) = 1. This is the standard IBP. 

• The exponential, f{d) = exp{—f3d). 

• The logistic, f{d) = 1/(1 + exp(/3(i — v)). 

• The window, f [d) = \ [d < u\. 

Each decay function encourages the sharing of features across nearby rows in a different way. 

When combined with an observation model (which specifies how the latent features give rise to 
observed data), the dd-IBP functions as a prior over latent feature representations of a data set. 
In section |6j we consider a specific example of how the dd-IBP can be used to analyze data. 



2.3 Marginal invar iance and exchangeability 

Unlike the traditional IBP, the dd-IBP is not (in general) marginally invariant, the property that 
removing a customer leaves the distribution over latent features for the remaining customers un- 
changed. (The dd-IBP builds on the dd-CRP, which is not marginally invariant either.) In some 
circumstances, marginal invariance is desirable for computational reasons. For example, the con- 
ditional distributions over missing data for models lacking marginal invariance require computing 
ratios of normalization constants. In contrast, marginally invariant models, due to their factorized 
structure, require less computation for conditional distributions over missing data. In other circum- 
stances, such as exploratory analysis of fully observed datasets, this computational concern is less 
important. Beyond computational considerations, while marginal invariance may be an appropriate 
modeling assumption in some data sets, it may be inappropriate in others. 

Also unlike the traditional IBP, the dd-IBP is not exchangeable in general. To state this formally, 
let vr be a permutation of the integers {1, . . . , A^}, and for a given N x oo binary matrix Z, let 
be the matrix created by permuting its rows according to vr. Let Z be drawn from the dd-IBP with 
distance matrix D, mass parameter a and decay function /. Then, except in certain special cases 
(such as when D recovers the traditional IBP), 

P{Z = z|D, a, f) / P(Z = Z-|D, a, /). 

Permuting the data changes its distribution, and so the dd-IBP is not exchangeable in general. 

Although the dd-IBP is not exchangeable, it does have a related symmetry. Let D'^ be the N x N 
matrix D with both its rows and its columns permuted according to vr, and let Z'^ be drawn from 
the dd-IBP with distance matrix D'^ rather than D. (We retain the same values for a and /.) 
Then, in general, 

P(Z = z|D, a, f) = P{Z^ = z^D, a, /). 

Thus, if we permute both the data and the distance matrix, probabilities remain unchanged. Per- 
muting both the data and the distance matrix is like first relabeling the data, and then explicitly 
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altering the probability distribution to account for this relabeling. If the dd-IBP were exchangeable, 
one would not need to alter the probability distribution to account for relabeling. 



3 Related work 



In this section we describe related work on infinite latent feature models that capture external 
dependence between the data. We focus on the most closely related model, which is the depen- 
dent hierarchical beta process (dHBP; ^31j)- As a prelude to describing the dHBP, we review the 
connection between the IBP and the beta process. 



3.1 The beta process 

Recall that the IBP is exchangeable. Consequently, by de Finetti's theorem j3j, the rows of Z, 
considered as binary vectors {zj}, are conditionally independent, 

N 

P(Z)= I llP{z,\B)dP{B). (6) 

In this marginal distribution, i? is a random measure on the feature space Q and P{B) is the de 
Finetti mixing distribution (see [H]). Thibaux and Jordan [2H] showed that the de Finetti mixing 
distribution underlying the IBP is the beta process (BP), parameterized by a positive concentration 
parameter c and a base measure Bq on fi. A draw B ~ BP(c, Bq) is defined by a countably infinite 
collection of weighted atoms, 

oo 

B = J2PkS.,, (T) 
fc=i 

where J^j is a probability distribution that places a single atom at uj G ft, and the pk G [0, 1] are 
independent random variables whose distribution is described as follows. If Bq is continuous, then 
the atoms and their weights are drawn from a nonhomogeneous Poisson process defined on the 
space Q x [0, 1] with rate measure 

v{doj,dp) = cp-^il-py-^dpBoidu}). (8) 

If Bq is discrete and of the form Bq = YlT=i1k^^k-' Ik ^ then B has atoms at the same 

locations as Bq, with ~ Iieta{cqk, c(l — qk))- Following Thibaux and Jordan [5H], we define the 
mass parameter as 7 = -Bo(^^)- Note that Bq is not necessarily a probability measure, and hence 7 
can take on non- negative values different from 1. 

Conditional on a draw from the beta process, the feature representation Xi of data point i is 
generated by drawing from the Bernoulli process (BeP) with base measure B: Xi\B ~ BeP(i?). 
If B is discrete, then Xi = '^'kLi ^ikS^k^ where Zik ~ Bernoulli (pfc)- In other words, feature k is 
activated with probability pk independently for all data points. Sampling Z from the compound 
beta-Bernoulli process is equivalent to sampling Z directly from the IBP when c = 1 and 7 = a 
[281. 



9 



3.2 Dependent hierarchical beta processes 

The dHBP [31J builds external dependence between data points using the above BP construction. 
The dependencies are induced by mixing independent BP random measures, weighted by their 
proximities A. 

The dHBP is based on the following generative process, 

Xi\B*. ~ BeP {B*g.), Qi ~ Multinomial (ai), 

~ BP(ci,S), S~BP(co,Bo)- (9) 

This is equivalent to drawing Xi from a Bernoulli process whose base measure is a linear combination 
of BP random measures, 

N 

Xi\Bi^BeP{B,), B, = ^a,jB*. (10) 

Dependencies between data points are captured in the dHBP by the proximity matrix A, as in the 
dd-IBP|^ This allows proximal data points (e.g., in time or space) to share more latent features 
than distant ones. 

In Section [4j we compare the feature-sharing properties of the dHBP and dd-IBP. Using an asymp- 
totic analysis, we show that the dd-IBP offers more flexibility in modeling the proportion of features 
shared between data points, but less flexibility in modeling uncertainty about these proportions. 

3.3 Other non-exchangeable variants 

Although still a nascent area of research, several other non-exchangeable priors for infinite latent 
feature models have been proposed. Williamson, Orbanz and Ghahramani [29j used a hierarchical 
Gaussian process to couple the latent features of data in a covariate-dependent manner. They 
named this model the dependent Indian buffet process (dIBP). Their framework is flexible: It can 
couple columns of Z in addition to rows, while the dd-IBP cannot. However, this flexibility comes 
at a computational cost during inference: Their algorithm requires sampling an extra layer of 
variables. 

Miller, Griffiths and Jordan [21j proposed a "phylogenetic IBP" that encodes tree-structured de- 
pendencies between data. Doshi-Velez and Ghahramani [11] proposed a "correlated IBP" that 
couples data points and features through a set of latent clusters. Both of these models relax ex- 
changeability, but they do not allow dependencies to be specified directly in terms of distances 
between data. Furthermore, inference for these models requires more intensive computation than 
does the standard IBP. The MCMC algorithm presented by Miller et al. j21j for the phylogenetic 
IBP involves both dynamic programming and auxiliary variable sampling. Similarly, the MCMC 
algorithm for the correlated IBP involves sampling latent clusters in addition to latent features. 

^Zhou et al. |31| formalize dependencies in an equivalent manner using a normalized kernel function defined over 
pairs of covariates associated with the data points. 
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Our model also incurs extra computational cost relative to the traditional IBP due to the com- 
putation of reachability (quadratic in the number of observations); however, it permits a richer 
specification of the dependency structure between observations than either the phylogenetic or the 
correlated IBP. 



Recently, Ren et al. [26] presented a novel way of introducing dependency into latent feature models 

based on the beta process. Instead of defining distances between customers, each dish is associated 
with a latent covariate vector, and distances are defined between each customer's (observed) covari- 
ates and the dish-specific covariates. Customers then choose dishes with probability proportional 
to the customer-dish proximity. This construction comes with a significant computational advan- 
tage for data sets where the time complexity is tied predominantly to the number of observations. 
The downside of this construction is that the MCMC algorithm used for inference must sample a 
separate covariate vector for each dish, which may scale poorly if the covariate dimensionality is 
high. 

4 Characterizing feature-sharing 

In this section, we compare the feature-sharing properties of the dHBP and dd-IBP. Two data points 
share a feature if that feature is active for both (i.e., Zik = zjk = 1 for i ^ j and a given feature k). 
This analysis is useful for understanding the types of dependencies induced by the different models, 
and can help guide the choice of model and hyperparameter settings for particular data analysis 
problems. We consider an asymptotic regime in which the mass parameter is large (a for the dd- 
IBP and 7 for the dHBP), which simplifies feature-sharing properties. Proofs of all propositions in 
this section may be found in the Appendix. 

4.1 Feature- sharing in the dd-IBP 

We first characterize the limiting distributional properties of feature-sharing in the dd-IBP as 
a — )• oo. We drop the feature index k in the reachability indicator Cijk, writing it jCij. We do this 
because features (that is, columns of the binary matrix Z) are exchangeable under the dd-IBP and, 
consequently, the distribution of the random vector {Cijk : i, j = 1, . . . ,n) is invariant across k. 

Let Ri = X^fcLi Zik denote the number of features held by data point i, and let Rij = YlT=i ^ikZjk 
denote the number of features shared by data points i and j, where i ^ j- 

Proposition 1. Under the dd-IBP, 
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The probabilities P{Cin = 1) and P{Cin = IjCjn = 1) depend strongly on the distribution of the 
connectivity matrix C, but do not depend on the ownership vector c*, since C is independent of 
dish ownership. 

We derive the limiting properties of Ri and Rij from properties of the Poisson distribution. In this 
and following results, indicates convergence in distribution. 

Corollary 1. Let i ^ j- Ri and Rij converge in distribution under the dd-IBP to the following 
constants as a oo: 

^A^/.-lp(An = l), (13) 

n=l 

n ^ ^ 

Aj2^n'P{^in = hjOjn = 1), (14) 

n=l 

Rjj Z]n=l hn^Pj^in = ^,^jn = 1) 

This corollary shows that the limiting fraction of shared features Rij/Ri in the dd-IBP is a constant 
that may be different for each pair of data points i and j. In contrast, we show below that the 
same limiting fraction under the dHBP is random, and takes one of two values. These two values 
are fixed, and do not depend upon the data points i and j. 



4.2 Feature-sharing in the dHBP 

Here we characterize the limiting distributional properties of feature sharing in the dHBP as Bo 
becomes infinitely concentrated (i.e., 7 00, analogous to a ^ 00). In this limit, feature-sharing 
is primarily attributable to dependency induced by the proximity matrix A. 

Proposition 2. If Bq is continuous, then under the dHBP, 

Ri\gi:N ~ Poisson (7) , (16) 



Rij\gl:N ^ ' 



Poisson (7 (,;^tKcT+i) j if9^=9J, ^^^^ 
Poisson ^ ) ifgi^ 9j ■ 



We derive the limiting properties of Ri and Rij from properties of the Poisson distribution. 

Corollary 2. Let i ^ j. Conditional on gi-.N, Ri and Rij converge in distribution under the dHBP 
to the following constants as 7 —>■ 00 .• 



(18) 



7 

4 J (co+i)?ci+i) ^/ 9i = 9j, 

^ ^ I (co+i)?ct+i) 9i = 9j, ^20) 

Ri \ ^ if9i^9j- 
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Thus, the expected fraction of object i's features shared with object j, Rij/Ri, is a factor of '^"^'^j^'^ 
bigger when gi = gj. As cq — )• oo, this fraction goes to oo. As cq — >• 0, it goes to 1. We can obtain 
the unconditional fraction by marginahzing over g^ and gj: 

Corollary 3. Let i ^ j. Rij/Ri converges in distribution under the dHBP as j ^ oo to a random 
variable Mij defined by 

Mij = I (^o+iK'^t+i) y^ith probability P{gi = g,), ^^^^ 
1 with probability P{gi ^ gj), 



where P{gi = gj) = J2n=i ^^ina 



This corollary shows that as 7 grows large, the fraction of shared features becomes one of two values 
(determined by cq and ci), with a mixing probability determined by the dependency structure. 
Thus, the dHBP affords substantial flexibility in specifying the mixing probability (via A), but is 
constrained to two possible values of the limiting fraction. 



4.3 Feature-sharing in the IBP 



For comparison, we briefly describe the feature-sharing properties under the traditional IBP. 

Under the traditional IBP, by exchangeability, Ri and Rij are equal in distribution to Ri and 
Ri2- The first customer draws a Poisson(Q!) number of dishes. The second customer then chooses 
whether to sample each of these dishes independently and with probability 1/2. Thus, the number 
of dishes sampled by both the first and second customers is R12 ~ Poisson(a/2). 

This shows that, under the traditional IBP, as a —t- 00 with i ^ j, 

^4i, ^j^i, ^ ^ 



4.4 Discussion 



Using an asymptotic analysis, the preceding theoretical results show that the dd-IBP and dHBP 
provide different forms of flexibility in specifying the way in which features are shared between 
data points. This asymptotic analysis takes the limit as the mass parameters a and 7 become 
large. This limit is taken for theoretical tractability, and removes much of the uncertainty that is 
otherwise present in these models. While such limiting dd-IBP and dHBP models are not intended 
for practical use, their simplicity provides insight into behavior in non-asymptotic regimes. 

Under the dd-IBP, Corollary [T] shows that the modeler is allowed a great deal of flexibility in spec- 
ifying the proportions of features shared by data points. Given a matrix specifying the proportion 
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dHBP 




Figure 4: Feature-sharing in the dHBP and dd-IBP, limiting case. Along the horizontal 
axis, we show 4 independent draws from the dHBP (Top) and dd-lBP (Bottom). Within each 
subfigure, the shade of a cell shows the fraction Rij/Ri, where Ri is the number of features 
held by data point i, and Rij is the number held by both i and j. Diagonals Ra/Ri = 1 have been 
set to for clarity. Here, = 7 = 1000. Limiting results from Section |4] explain the behavior for 
such large a and 7: for the dHBP the feature-sharing proportion Rij/Ri is random and equal to 
one of two constants; for the dd-lBP the proportion is non-random and takes a range of values. 
The dd-IBP models feature-sharing proportions that differ across data points, but does not model 
uncertainty about these proportions when mass parameters are large. 



of features that are believed to be shared by pairs of data points, one can (if this matrix is suffi- 
ciently well-behaved) design a distance matrix that causes the dd-IBP to concentrate on the desired 
proportions. While the dd-IBP cannot model an arbitrary modeler-specified matrix of proportions, 
the set of matrices that can be modeled is very large. 

In contrast, under the dHBP, Corollary [3] shows that the modeler has less fiexibility in specifying 
the proportions of features shared. Under the dHBP, the modeler chooses two values, (cq + ci + 
l)/(co + 1) (ci + 1) and l/(co + 1) , and the proportion of features shared by any pair of data points 
in the asymptotic regime must be one of these two values. 



Section 4.3 shows that the traditional IBP has the least flexibility. In the asymptotic regime, the 



proportion of features shared by each pair of data points is a constant. 

While the dd-IBP has more flexibility in specifying values of the feature-sharing-proportions than 
the dHBP, it has less flexibility (at least in this asymptotic regime) in modeling uncertainty about 
these feature-sharing proportions. Under the dd-IBP, the proportion of features shared by a pair of 
data points in the asymptotic regime is a deterministic quantity. Under the dHBP, the proportion 
of features shared is a random quantity, even in the asymptotic regime. A modeler using the dHBP 
has full flexibility in choosing the joint probability distribution governing these proportions. One 
could extend the dd-IBP to allow uncertainty about the feature-sharing-proportions by specifying 
a hyperprior over distance matrices, but we do not consider this extension further. 
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Figure 5: Feature-sharing in the dHBP and dd-IBP. Heatmaps of the probability mass 
function over the number of shared features Rij (y-axis) as a function of proximity aij (x-axis) in a 
data set consisting of two data points. Black indicates a probability mass of 0, with lighter shades 
indicating larger values. For the dHBP, we set cq = 10 and ci = 1. Note that E[i2j] is the same for 
both the dHBP and dd-IBP in these examples (when = 7). 



Figure |4] illustrates the difference in asymptotic feature-sharing behavior between the dHBP and 
dd-IBP. Subfigures in the upper row are draws from the dHBP, and subfigures in the bottom row are 
draws from the dd-IBP. Within a single subfigure, the shade in the cell (i, j) is the fraction Rij/Ri. 
(The diagonals Ra/Ri = 1 have been set to to bring out other aspects of the matrix.) Each of 
the four columns represents a pair of independent draws. To approximate the asymptotic regime 
considered by the theory, the mass parameters for the two models are set to large values of 7 = a = 
1000. The figure shows that, in draws from the dHBP, off-diagonal cells have one of two shades, 
corresponding to the two possible limiting values for Rij/Ri. In the different columns, correspoindng 
to different independent draws, the patterns are different, showing that Rij/Ri remains random 
under the dHBP, even in the asymptotic regime. In contrast, in draws from the dd-IBP, off-diagonal 
cells take a variety of different values, but remain unchanged across independent draws. 

Figure [5] illustrates non-asymptotic feature-sharing behavior in a simple setting with only two data 
points. The figure shows the feature-sharing behavior of the dHBP (top) and dd-IBP (bottom) 
at two values for the mass parameter: a = 7 = 15 (top row) and a = 7 = 30 (bottom row). 
Each subfigure shows the probability mass function P{Rij) as a function the proximity aij, where 
aij = 1/dij for the dd-IBP. Because there are only two data points, with an = 1 and a^j = aji, 
specifying aij is sufficient for specifying the full proximity matrix A. For the dHBP, we set cq = 10 
and ci = 1. Also facilitating comparison, E[i?j] is the same between both models (when = 7). 

Figure [5] shows that as the proximity aij increases to 1 , the number of shared features Rij tends 
to increase under both models. More precisely, P{Rij) concentrates on larger values of Rij as a^j 
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increases. However, the way in which the probabihty mass functions change with Ojj differs between 
the two models. In the dd-IBP, the most hkely value of Rij increases smoothly, while under the 
dHBP it remains roughly constant and then jumps. As one varies Uij across its full range, the set 
of most likely values for Rij under the dd-IBP spans its full range from to 20, while under the 
dHBP the most likely value for Rij takes only a few values. Instead, varying Ojj under the dHBP 
allows a variety of bimodal distributions centered near the values from the asymptotic analysis. 

This difference in non-asymptotic behaviors mirrors the difference between the two models in the 
asymptotic regime, where the dd-IBP allows feature-sharing-proportions to be specified almost 
arbitrarily but allows little flexibility in modeling uncertainty about them, and the dHBP limits 
the number of possible values for the feature-sharing proportions, but allows uncertainty over these 
values. 



5 Inference using Markov chain Monte Carlo sampling 

Given a dataset X = {xj}^^ and a latent feature model P(X.\Zi,0) with parameter 9, the goal of 
inference is to compute the joint posterior over the customer assignment matrix C, the dd-IBP 
hyperparameter a, and likelihood parameter 9, as given by Bayes' rule: 

P(C,c*,0,a|X,D,/) oc PiX\C,c*,9)P{9)PiC\nj)P{c*\a)Pia), (23) 

where the first term is the likelihood (recall that Z is a deterministic function of C and c*), the 
second term is the prior over parameters, the third term is the dd-IBP prior over the connectivity 
matrix C, the fourth term is the prior over the ownership vector c*, and the last term is the prior 
over a. In what follows, we assume that Xj is conditionally independent of Zj and Xj for j ^ i given 
Zj and 9. 

Exact inference in this model is computationally intractable. We therefore use MCMC sampling 
|27j to approximate the posterior with L samples. The algorithm can be adapted to different 
datasets by choosing an appropriate likelihood function. In the next section, we show how to adapt 
this algorithm to a simple linear-Gaussian modelj^ 

Our algorithm combines Gibbs and Metropolis updates. For Gibbs updates, we sample a variable 
from its conditional distribution given the current states of all the other variables. Conjugacy allows 
simple Gibbs updates for 9 and a. Because the dd-IBP prior is not conjugate to the likelihood, we 
use the Metropolis algorithm to sample C and c*. We generate proposals for C and c*, and then 
accept or reject them based on the likelihood ratio. We further divide these updates into two cases: 
updates for "owned" (active) dishes and updates of dish ownership. 

Sampling 9. To sample the likelihood parameter 9, we draw from the following conditional distri- 
bution: 

P{9\X, C, c*) oc P(X|C, c*,9)P{9), (24) 



''Matlab software implementing this algorithm is available at the first author's homepage: www . pr inceton . edu/| 
psjgershm 
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where the prior and Hkehhood are problem-specific. To obtain a closed-form expression for this 
conditional distribution, the prior and likelihood must be conjugate. For non-conjugate priors, one 
can use alternative updates, such as Metropolis-Hastings or slice sampling |27j. Generally, updates 
for 9 will be decomposed into separate updates for each component of 6*. In some cases, 9 can be 
marginalized analytically; an example is presented in the next section. 

Sampling a. To sample the hyperparameter a, we draw from the following conditional distribution: 

N 

P(a|c*,D,/) oc P(a) JJPoisson(Ai;a//ii), (25) 
1=1 

where Aj is determined by c* and the prior on a is a Gamma distribution with shape Va and inverse 
scale r]a. Using the conjugacy of the Gamma and Poisson distributions, the conditional distribution 
over a is given by: 

/ N N \ 

a|c*, D, / ~ Gamma i^a -1- ^ Aj , t/q -|- ^ h^^ . (26) 

\ i=l i=l / 



Sampling assignments for OYi^ned dishes. We update customer assignments for owned dishes 
(corresponding to "active" features) using Gibbs sampling. For n = 1, . . . , A^, i = 1,. . . ,N, and 
k G ICn, we draw a sample from the conditional distribution over Cik given the current state of all 
the other variables: 



P(cifc|c_i, x„ c*,9, D, /) cc P(x,|C, c*,9)P{cik\r>, f), (27) 

where Xj is the ith row of X, Cj is the ith row of C, and c_j is C excluding row ij^ The first factor 
in Eq. 27 is the likelihood H and the second factor is the prior, given by P{cik = j|D,/) = Uij. In 



considering possible assignments of Cjfc, one of two scenarios will occur: Either data point i reaches 
the owner of k (in which case feature k becomes active for i as well as for all other data points that 
reach i), or it does not (in which case feature k becomes inactive for i as well as for all other data 
points that reach i). This means we only need to consider two different likelihoods when updating 

Sampling dish ownership. We update dish ownership and customer assignments for newly 
owned dishes (corresponding to features going from inactive to active in the sampling step) using 
Metropolis sampling. Both a new connectivity matrix C and ownership vector c* are proposed 
by drawing from the prior, and then accepted or rejected according to a likelihood ratio. In more 
detail, the update proceeds as follows. 

1. Propose A^ ~ Poisson(a//ii) for each data point i = 1, . . . ,N . 

2. Set C C. Then populate or depopulate it by performing, for each z = 1, . . . , A^, 



^We rely on several conditional independencies in this expression; for example, Xi is conditionally independent of 
X_i given C,c*, and 9. 

®In calculating the likelihood, we only include the active columns of Z (i.e., those for which X]„=i -^ifc > 0)- 
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(a) If \\ > Xi, insert A'^ — Aj new dishes into /Cj. Then, for aU G [Aj + 1, A'J and m ^ i, 
sample c^^ according to P{c'^k = i) = "^i- 

(b) If \\ < Xi, remove X[ — Xi randomly selected dishes from /Cj. 

This reallocation of dishes induces a new ownership vector c* . 

3. Compute the acceptance ratio C- Because the prior (conditional on the current state of the 
Markov chain) is being used as the proposal distribution, the acceptance ratio reduces to a 
likelihood ratio (the prior and proposal terms cancel out): 



^ = min 



^ P(X|Cc*', 
' P(X|C,c*,l 



(28) 



4. Draw r ~ Bernoulli(C)- Set C C and c* c*' if r = 1, otherwise leave C and c* 
unchanged. 

Iteratively applying these updates, the sampler will (after a burn-in period) draw samples from 



a distribution that approaches the posterior (Eq. 23) as the burn-in period grows large. The 
time complexity of this algorithm is dominated by the reachability computation, 0{KN'^), and the 
likelihood computation, which is 0{N'^) if coded naively (see [TB] for a more efficient implementation 
using rank-one updates). 



6 A linear- Gaussian model 



As an example of how the dd-IBP can be used in data analysis, we incorporate it into a linear- 
Gaussian latent feature model (Figurejoj). This model was originally studied for the IBP by Griffiths 
and Ghahramani [161 117j . The observed data X G ]^NxM consist of N objects, each of which is 
a M-dimensional vector of real-valued object properties. We model X as a linear combination of 
binary latent features corrupted by Gaussian noise: 

X = ZW + e, (29) 

where W is a K x M matrix of real-valued weights, and e is a x M matrix of independent, zero- 
mean Gaussian noise terms with standard deviation ax- We place a zero-mean Gaussian prior on 
W with covariance cr^I. Intuitively, the weights capture how the latent features interact to produce 
the observed data. For example, if each latent feature corresponds to a person in an image, then 
the weight Wkm captures the contribution of person k to pixel m. 

Within the algorithm of the previous section, 9 = W. As a consequence of our Gaussian assump- 
tions, W can be marginalized analytically, yielding the likelihood: 

P(X|Z) = /" P(X|Z, W)P(W)dW 
Jw 

exp{-^tr(X^(I-ZH-iZ^)X)} 

= !^ 5 L (30) 

(2^)iVM/2^f-^)M^j,M|H|M/2 ' 
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Figure 6: Linear-Gaussian model. Matrix multiplication view of how latent features (Z) combine 
with a weight matrix (W) and white noise (e) to produce observed data (X). 

2 

where tr(-) is the matrix trace and H = Z"^Z + ^I. In calculating the likelihood, we only include 

the "active" columns of Z (i.e., those for which X^^i -Zjfc > 0), and K is the number of active 
columns. 



7 Experimental results 



In this section we report experimental investigations of the dd-IBP and comparisons with alternative 
models. We first show how the dd-IBP can be used as a dimensionality reduction pre-processing 
technique for classification tasks when the data points are non-exchangeable. We then show how 
the dd-IBP can be applied to missing data problems. 



7.1 Dimensionality reduction for classification 

The performance of supervised learning algorithms is often enhanced by pre-processing the data to 
reduce its dimensionality [3]. Classical techniques for dimensionality reduction, such as principal 
components analysis and factor analysis, assume exchangeability, as does the infinite latent feature 
model based on the IBP |16J. For this reason, these techniques may not work as well for pre- 
processing non-exchangeable data, and this may adversely affect their performance on supervised 
learning tasks. 

We investigated this hypothesis using a magnetic resonance imaging (MRI) data set collected from 
27 patients with Alzheimer's disease and 35 healthy controls [8]j^ The observed features consist 
of 4 structural summary statistics measured in 56 brain regions of interest: (1) surface area; (2) 
shape index; (3) curvedness; (4) fractal dimension. The classification task is to sort individuals 
into Alzheimer's or control classes based on their observed features. 

Age-related changes in brain structure produce natural declines in cognitive function that make 
diagnosis of Alzheimer's disease difficult ^3j. Thus, it is important to take age into account 



^Available at: http: //wiki . stat .ucla.edu/socr/index.php/S0CR_Data_July2009_ID_NI 
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Figure 7: Trace plots. Representative traces of the log joint probability of the Alzheimer's data 
and latent variables for the dd-IBP (top) and dHBP (bottom). Each iteration corresponds to a 
sweep over all the latent variables. 



when designing predictive models. For the dd-IBP and dHBP, age is naturally incorporated as a 
covariate over which we constructed a distance matrix. Specifically, we defined dij as the absolute 
age difference between subjects i and j, with dij = oo for j > i (i.e., the distance matrix is 
sequential). This induces a prior belief that individuals with similar ages tend to share more latent 
features. In the MRI data set, ages ranged from 60 to 90 (median: 76.5). 

In detail, we ran 1500 iterations of MCMC sampling on the entire data set using the linear-Gaussian 
observation model, and then selected the latent features of the maximum a posteriori sample as 
input to a supervised learning algorithm (L2-regularized logistic regression, with the regularization 
constant set to 10~^). Training was performed on half of the data, and testing on the other halfj^ 
The noise hyperparameters of the dd-IBP and dHBP (ax and ay) were updated using Metropolis- 
Hastings proposals. 

We monitored the log of the joint distribution log P(X, a, C, c*). Visual inspection of the log joint 
probability traces suggested that the sampler reaches a local maximum within 400-500 iterations 
(Figure [T]). This process was repeated for a range of decay parameter (/3) values, using the expo- 
nential decay function. The same proximity matrix. A, was used for both the dd-IBP and dHBP. 
We performed 5 random restarts of the sampler and recomputed the classification measure for each 
restart, averaging the resulting measures to reduce sampling variability. For comparison, we also 
made predictions using the standard IBP, the dIBP [29|, and the raw observed features (i.e., no 
pre-processing). The dIBP was fit using the MCMC algorithm described in Williamson et al. |29j . 
which adaptively samples the parameters controlling dependencies between observations (thus the 

*A few randomly chosen individuals were removed from the test set to make it balanced. 
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Figure 8: Classification results for Alzheimer's data set. Area under the curve (AUG) for 
binary classification (Alzheimer's vs. normal control) using L2-regularized logistic regression and 
features learned from a linear-Gaussian latent feature model. Each curve represents a different 
choice of predictor variables (latent features) for logistic regression. The x-axis corresponds to 
different settings of the exponential decay function parameter, /3. "Raw" refers to the original data 
features (see text for details); the IBP, dd-IBP, dIBP and dHBP results were based on using the 
latent features of the maximum a posteriori sample following 1500 iterations of MGMG sampling. 



results do not depend on (3). 

Glassification results are shown in Figure [s] (left), where performance is measured as the area under 
the receiver operating characteristic curve (AUG). Ghance performance corresponds to an AUG of 
0.5, perfect performance to an AUG of 1. For a range of /3 values, the dd-IBP produces superior 
classification performance to the alternative models, with performance increasing as a function of 
/3. The dHBP performs worse relative to the raw data for low f3 values. The magnitude of the 
standard error (across random restarts) is roughly 1/10 that of the means. 

We also ran the dd-IBP sampler with /3 = (in which case the dd-IBP and IBP are equivalent) and 
found no significant difference between it and the standard IBP sampler with respect to performance 
on the Alzheimer's classification and the EEG reconstruction (see next section). 



7.2 Reconstructing missing data 



As an example of a missing data problem, we use latent feature models to reconstruct missing 
observations in electroencephalography (EEG) time series. The EEG datsj^ are from a visual 
detection experiment in which human subjects were asked to count how many times a particular 



Available at: 



http : //mmspl . epf 1 . ch/page33712 . html| 
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Figure 9: Reconstruction of missing EEG data. Reconstruction error for latent feature models 
as a function of the exponential decay function parameter, (3. Results were based on the maximum 
a posteriori sample following 1500 iterations of MCMC sampling. Lower values indicate better 
performance. 

image appeared on the screen |18| . The data were collected as part of a larger effort to design 
brain-computer interfaces to assist physically disabled subjects. 

Distance between data points was defined using the absolute time-difference. Data were z-scored 
prior to analysis. For 10 of the data points, we removed 2 of the observed features at random. 
We then ran the MCMC sampler for 1500 iterations, adding Gibbs updates for the missing data 
by sampling from the observation distribution (Eq. [SO] ) conditional on the current values of the 
latent features and hyperparameters. We then used the MAP sample for reconstruction. We 
measured performance by the squared reconstruction error on the missing data. Figure [9] shows the 
reconstruction results, demonstrating that the dd-IBP is effective for reconstructing missing data 
in this dataset and achieves lower reconstruction error than the alternative models we consider. 



8 Conclusions 

By relaxing the exchangeability assumption for infinite latent feature models, the dd-IBP extends 
their applicability to a richer class of data. We have shown empirically that this innovation fares 
better than the standard IBP on non-exchangeable data (e.g., timeseries). 

We note that the dd-IBP is not a standard Bayesian nonparametric distribution, in the sense 
of arising from a de Finetti mixing distribution. For the standard IBP, the de Finetti mixing 
distribution has been identified as the beta process [28] , but this result does not generalize to the 
dd-IBP due to its non-exchangeability (a consequence of de Finetti's theorem). Nonetheless, this 
does not detract from our model's ability to let the data infer the number of latent features, a 
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property that it shares with other infinite latent feature models. 

A number of future directions are possible. First, it may be possible to exploit distance dependence 
to derive more efficient samplers. In particular, Doshi-Velez and Ghahramani [lOj have shown that 
partitioning the data into subsets enables faster Gibbs sampling for the traditional IBP; the window 
decay function imposes a natural partition of the data into conditionally independent subsets. 

Second, application of the dd-IBP to other likelihood functions is straightforward. For example, it 
could be applied to relational data jlHl IH] or text data [28j. As pointed out by Miller et al. [22] . 
covariates like age or location often play an important role in link prediction. Whereas Miller et al. 
|22j incorporated covariates into the likelihood function, one could instead incorporate them into 
the prior by defining covariate-based distances between data points (e.g., the age difference between 
two people). A distinction of the latter approach is that it would allow one to model dependencies 
in terms of latent features. For instance, two people close in age or geographic location may be 
more likely to share latent interests, a pattern naturally captured by the dd-IBP. 

Third, modeling shared dependency structure across groups is important for several applications. 
In fMRI and EEG studies, for example, similar spatial and temporal dependencies are frequently 
observed across subjects. Modeling shared structure without sacrificing intersubject variability has 
been addressed with hierarchical models [21 [30]. One way to extend the dd-IBP hierarchically would 
be to allow the parameters of the decay function to vary across individuals while being coupled 
together by higher-level variables. 
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Appendix: proofs 

Recall that Hi = YlT=i ■^ik number of features held by data point i, and Rij = YlT=i ^ik^jk 

is the number of features shared by data points i and j, where i / j. 

Proposition 1 

Under the dd-IBP, 




(31) 



(32) 
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Proof. For each feature, there is some probabihty VTj that it is turned on for data point i, and some 
probabihty iVij that it is shared by i and j (note that features are exchangeable in the dd-IBP). 
The total number of features across all data points is distributed according to Poisson(A), where 
X = a '^n=i ^n^- Since each feature is turned on independently, the total number of active features 
for a single data point i is distributed according to Ri ~ Poisson(A7rj). Similarly, the total number 
of features shared by data points i and j is Rij ^ Poisson(A7rij). The activation and co-activation 
probabilities are given by: 



Rij\gl:N 



Poisson (75^) ifgiy^ 9j ■ 



(33) 



vr. = E Pic* = n)P{Ur. = 1) = ^-^^W""!^: ~ ' 

n=l ^j=l "■j 

= E P{C* = n)P{C,^ = 1, = 1) = ^n.^^'^'^"'=l^^"='\ (34) 

n=l ^j=l "■j 

where we have used that P{c* = n) = h~^/ X^jLi hj^. □ 
Proposition 2 

// Bq is continuous, then under the dHBP, 

Ri\gi:N ^ Poisson (7) , (35) 
Poisson f7 (,;;|)%t+i) ) ^f9i = 9j. 



(36) 



Proof. We write the random measures B and B* in the generative model defining the dHBP in 
Section 3.2 as the following mixtures over point masses. 

00 

-B = Epfe(5a,fc, pjt ~ Beta(0,co), Wfe ~ 5o. (37) 

k=l 

00 

B* = J^Pjk^^k^ Pjk ^ Beta(ci^>jk, ci(l - Pk))- (38) 

k=l 

Recall that Xi ~ BeP{B*.) where gi ~ Multinomial (a^). 

Let 

be the random variable that is 1 if the Bernoulli process draw has atom ui^, and if 
not. We have Zik ~ Bernoulli (^?*.j^.). Because Bq is continuous, P{uik = ^k') = toi k ^ k' and the 
random variables Ri and Rij satisfy 

K K 

Ri = '^Zik and Rij = '^ZikZjk. (39) 

k=l k=l 

We first show that Ri is Poisson distributed with mean 7. 
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Let gi(e) denote the probability that Xi has atom conditioned on > e (this value does not 
depend on k). That is, qi{e) = P{zik = > e). 

For a given e, the density of pk conditioned on > e is: 

P{Pk £ dp\pk >e)= ,1"^^"'^/ , dp, pe{e,l). (40) 

We can use this density to calculate the success probability ^^(e): 

%(e) = E[z,febfe > e] = ^\plk\Pk > e] = ^\pk\Pk > e] = Je P^oP ^(1-^^° ^^^ ^ ^^^^ 

CQp-^{l-p)'^-^dp 

where we have used the tower property of conditional expectation in the second and third equalities. 

For a given e > 0, let denote the number of atoms in B with p^ > e. This number is Poisson- 
distributed with mean Ag = 7 cop~^{l — p)'^~^dp. 



Let Ri{e) be the number of such atoms that are also in X^. Because Ri{e) is the sum of Nf 
independent Bernoulli trials that each have success probability qi{e), it follows that Ri{e)\N, 
Binomial(iVg, gj(e)) and 

i?i(e)~Poisson(Ae^i(e)). (42) 
Because Ri = limg^o -Ri(e), it follows that Ri ~ Poisson(lime^o ^Qii^)), where 



lim XeQiie) = lim 



7^ cop-\i-pr-^dp 



J^pcop ^{1-py^ ^dp 
J^cop-^il-p)'^-'^dp 

= lim7 / pcop~^(l — py°~^dp = jco / (1 — pY°^^dp = 
where we have used that J^{1 — p)'^~^dp = ^. Thus Ri ~ Poisson(7). 

We perform a similar analysis to show the distribution of Rij. Let qij{e) denote the probability 
that Xi and Xj share atom conditional on p^ > e, gi and gj. That is, 

qij{e) = P{zik = Zij = l\gi,gj,pk > e). (43) 

Although only e appears in the argument of qij (e) , this quantity also implicitly depends on gi and 
gj. We calculate qij{e) explicitly below. 

Let Rij{€-) be the number of atoms Uk for which p^ > e and Uk is in both Xi and Xj. We have 
Rij{e)\N^,gi,gj ~ Binomial(A^£, gij(e)) and 

Riji^)\9i,9j ^ Poisson(Aegij(e)). (44) 
Because Rij = Mmg^Q Rij{e), it follows that Rij ~ Poisson(limg_^o KQiji^))- 

To calculate lime_^o ^Qiji^), we consider two cases. In each case, we first calculate % (e) and then 
calculate the limit, showing that it is the same as the mean of Rij claimed in the statement of the 
proposition. 
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Case 1: gi = gj 



Qijie) = E[zikZjk\Pk > e,9i,9j] = Hiplik) \Pk > (^,gi,gj] 

= HmP*g,kf\Pk,9i,9j]\Pk > e,gi,gj] = E[pk{cipk + l)/(ci + l)\pk > e,gi,gj] 

_ l!'^pcop-H^-pr"-'dp _ CO i\c,p+i){i~pr^-'dp 

Then the hmit hnie-^o ^eQij (e) can be written 



hni Aegi,(e) = / {dp + 1)(1 - p^-'dp 

e->0 Ci + 1 Jo 



Co 



Cl + 1 

Co 
Cl + 1 



Cl / p{l-pY°-^dp+ [ (l-p)^-^ 
Jo Jo 



dp 



Cl 



+ 



7 



Co + Cl + 1 

(co + l)(ci + l)^ 



_Co(co+l) Co_ 

where we have used that /q^(1 — p)'^~^dp = ^ and /o^p(l — p)'^~^dp = ^^{^+1) - 
Case 2: gi ^ gj 

Qijie) = 'E[zikZjk\pk > e,gi,gj] = E[E[p*y^^p*g^,^\pk, gi, gj]\pk > e,gi,gj] 

2| ^ J^p'^cop-Hl-py°-^dp 
= K\pk\pk > e,gi,gj\ = 7 . 

Then the hmit hmg^o ^eQij{^) can be written 

\imXeqijie) = jcq / p(l - pY" ^dp = 7 — -, 

e-^o Jo Co + 1 

where we have used that Jq p{i — p)'^~^dp = ■ 



(45) 



□ 
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